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Abstract 

Adaptivity, both of the individual agents and of the in- 
teraction structure among the agents, seems indispens- 
able for scaling up multi-agent systems (MAS's) in noisy 
environments. One important consideration in design- 
ing adaptive agents is choosing their action spaces to 
be as amenable as possible to machine learning tech- 
niques, especially to reinforcement learning (RL) tech- 
niques |2^] . One important way to have the interaction 
structure connecting agents itself be adaptive is to have 
the intentions and/or actions of the agents be in the in- 
put spaces of the other agents, much as in Stackelberg 
games [§, 0, H, |||. We consider both kinds of adap- 
tivity in the design of a MAS to control network packet 
routing [§l[ §, 0, @ We demonstrate on the OPNET 
event-driven network simulator the perhaps surprising 
fact that simply changing the action space of the agents 
to be better suited to RL can result in very large im- 
provements in their potential performance: at their best 
settings, our learning-amenable router agents achieve 
throughputs up to three and one half times better than 
that of the standard Bellman-Ford routing algorithm, 
even when the Bellman-Ford protocol traffic is main- 
tained. We then demonstrate that much of that poten- 
tial improvement can be realized by having the agents 
learn their settings when the agent interaction structure 
is itself adaptive. 

1 Introduction 

As time goes on, larger and larger multi-agent systems 
(MAS's) are being deployed as a way to meet a sin- 



gle overall goal for an underlying system, and this is 
being done for increasingly noisy and unreliable sys- 
tems HI, U|, H © II- However if one uses 
traditional "hand-tailoring" to design all aspects of a 
MAS, maintaining robustness while scaling up to large 
problems becomes increasingly difficult. Accordingly, it 
is becoming imperative to understand how best to have 
both the individual agents and the structure of their 
interactions be as adaptive as possible. 

In designing agents to be adaptive one should cast 
their action spaces in a form that is as amenable as pos- 
sible to machine learning techniques, especially to rein- 
forcement learning (RL) techniques p^ i |25| . However 
it is often the case that more than just the policies of 
the individual agents needs to be adaptive; for the sys- 
tem to perform well, often the very structure with which 
the agents interact also needs to be adaptive rather than 
hard-wired 
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One way to have that structure 
be adaptive is to exploit the existence of input spaces 
in RL-based agents by having the intentions and ac- 
tions of the agents be in the input spaces of the other 
agents, much as in Stackelberg games |||, [l5|, [l(| 
In this way as individual agents adapt their policies, 
information concerning the best way to adapt to those 
new policies is automatically propagated to the other 
agents. 

We consider both kinds of adaptivity in the design of 
a MAS to control network packet routing with the goal 
of maximizing throughput ||, |l2[ In this do " 
main the naive choice of the action space of each router 
agent is the categorical variable of the single outbound 
link along which to route the particular packet currently 
at the top of its queue. However in comparison to Eu- 
clidean variables with their inherent smoothness struc- 
ture, such a categorical variable is usually poorly suited 
to RL techniques. Accordingly, we instead consider hav- 
ing the action space be the vector of the proportion 
of packets the agent routes along each of its outbound 



links. Packets can be routed according to such a propor- 
tion vector by taking that vector to specify probabilities 
for all routing along all outbound links. Alternatively, 
one can route traffic deterministically, in such a way 
that the proportions of the traffic actually sent are as 
close as possible, according to any of a suite of metrics, 
to the desired proportion vector. In this paper we con- 
centrate on the second of these schemes]^] Whichever 
proportion- vector-based scheme one uses, one can have 
each agent learn how best to set the vectors it uses 
(one vector for each potential ultimate destination of 
the packet), and in this way adaptively determine how 
best to do its routing. 

One difficulty with this new action space is that it 
easily results in "cycles", in which a particular packet 
may return to an agent that had previously routed it. 
To avoid this, we developed the hard masking routing 
algorithm, which maps an original proportion vector to 
a new one. This algorithm has the property that if it 
is used by every agent, then any links leaving an agent 
that could result in cycles are "masked out" , so that no 
traffic is sent along that link. (In fact, in our exper- 
iments even if some agents do not use hard masking, 
if they employ conventional shortest-path routing algo- 
rithms, cycles will still be avoided.) At the same time, 
the ratios of traffic sent along all non-masked links are 
left unchanged. Hard masking requires no additional 
protocol traffic beyond that already contained in tradi- 
tional distance-vector or link-state routing algorithms 
[Q, pj. However if the agents use RL to learn propor- 
tion vectors that are run through the hard masking al- 
gorithm before being used, the system potentially can 
adapt to use far better proportions than those that arise 
(in a completely unintended manner) when one uses tra- 
ditional algorithms, while sharing with those algorithms 
the absence of any risk of cycling. 

Unfortunately, hard masking has a clipping prop- 
erty: it does not affect the amount of traffic being sent 
down a link until a certain property of that link reaches 
a threshold, at which point all traffic down that link is 
blocked. As one might expect, this hard clipping can 
reduce routing efficacy. To overcome this problem, we 
developed the soft masking routing algorithm, which 
gradually decreases traffic along a link as the threshold 
is approached, while still preventing cycles. The ver- 
sion of soft masking we use is optimal in that it is the 
unique variant of hard masking that preserves invari- 
ance both in rcscaling of time and/or packet sizes, and 
in translation of the zero-point of one's clock. 

We first investigated hard and soft masking on the 

1 Wc have developed a particularly fast implementation of this sec- 
ond scheme, an implementation that can also have built-in "data- 
aging'' , so that more recent traffic is counted more heavily than older 
traffic. In addition, this second scheme can be modified to avoid 
"round-robining" , so that packets—do not arrive out of sequence at 
their ultimate destination. See |26| for this and other extensions of 
this scheme. 



OPNET event-driven network simulator, without any 
learning on the part of the agents; we simply swept 
through the space of potential proportion vectors, record- 
ing performance as we went. These experiments were 
on relatively simple and small networks (currently all 
TCP/IP-based networks are broken down for routing 
purposes into subnetworks almost always having no more 
than a dozen routers). These runs demonstrated that 
at the optimal proportion vectors, masking can result in 
throughput up to five times better than that of Bellman- 
Ford (BF), the traditional routing algorithm we used as 
our comparison point. This improvement was achieved 
even though the full BF routing protocol traffic was still 
running "in the background" in the masking systems. 
We also found that the size of the basins in the space of 
potential proportion vectors which gave at least some 
improvement over BF was quite large — approximately 
half of the range of each component of each proportion 
vector in the case of soft masking. 

As mentioned above, the second component of an 
adaptive MAS beyond having the individual agents be 
adaptive is having the agent interaction structure itself 
be adaptive, so that the agents can automatically and 
adaptively cooperate with one another. To that end, we 
considered two possible interaction structures. The first 
can be viewed as an iterative Stackelberg game struc- 
ture (|, [T^, |l6|, [l8|. In this structure, "leader" agents 
first determine what proportion vectors they will use, 
and then the "follower" agents use that information to 
determine what proportion vectors they think will result 
in optimal performance. In other words, the proportion 
vectors of the leaders are components of the input space 
of the followers. We also investigated a less asymmetric 
structure, in which agents "interleaved" their decisions 
in such a way so that every agent was in some respects 
acting as a follower and in some respects acting as a 
leader. 

We investigated how much of the potential improve- 
ment of masking over BF can be realized by having the 
agents learn their proportion vectors using these kinds 
of adaptive agent interaction structures. We found that 
even using extremely simple RL algorithms, and with 
essentially no effort given to optimizing the soft mask- 
ing, when those agents operated within the adaptive 
structure outlined above, typically throughput was 3 
times better than with BF. We never encountered an 
instance in which soft masking consistently gave worse 
performance. 

In Section [2] we describe conventional routing algo- 
rithms, proportional routing, and the various forms of 
masking. In Section ^ we describe the learning schemes 
used and the adaptive agent interaction schemes inves- 
tigated. In Section || we present the results of our ex- 
periments. 



2 Agents for Network Routing 

2.1 Shortest Path Routing 

The most commonly used routing algorithms are based 
on the "shortest path", i.e., the path from a router to 
a destination that would experience the minimal cost 
if the traffic were routed down that path. In such al- 
gorithms each router stores the smallest of all possible 
costs to each destination, along with the first link on 
the associated path. (This data is commonly stored, 
sometimes along with other information, in a "routing 
table.") The router then sends all its packets bound 
for a particular destination along the first link on the 
associated shortest path. There are many algorithms 
for efficiently computing the shortest paths when the 
costs for traversing each router and link in the network 
are known, including Dijkstra's algorithm jl], [| g, g] 
and the BF algorithm ||[ ||[ [| 0. When applied in 
dynamic data networks of the kinds considered here, 
both algorithms entail some underlying protocol traffic, 
to allow the routing tables of the routers to adapt to 
changes in traffic conditions. 

2.2 Proportional Routing Agents 

Shortest path algorithms in general and BF in partic- 
ular have several shortcomings. In practice, the short- 
est path estimates are always based on old information, 
which means each router bases its routing decisions on 
potentially incorrect assumptions about the network. 
However even if a shortest path algorithm is provided 
the exact current costs of all the links, because it sends 
all of its traffic with the same ultimate destination down 
a single link, such an algorithm still provides subopti- 
mal solutions. (Formally, this suboptimality holds so 
long as we're not in the limit where each router makes 
infinitesimal routing decisions at each moment, with its 
routing table being updated before the next infinitesi- 
mal routing decision — see p|, [f7|, |24}.) 

This second problem with BF can potentially be alle- 
viated if each routing agent apportions its traffic bound 
for a particular destination along more than one path, 
rather than sending it all along the shortest path. In 
this paper we are concerned with agents that learn, dy- 
namically, how best to do this. As discussed above, we 
are interested in having each agent do its learning with 
an action space that consists of one proportion vector 
p satisfying: 

m 

< pi < 1, i = 1, . . . , m and p. L = 1 

i=i 

for each destination, m being the number of outbound 
links. This proportion vector then determines how the 
traffic bound to that destination from that router gets 



apportioned among that router's outbound links, as dis- 
cussed above. We call this "proportional routing" . 

2.3 Hard Masking 

Simple proportional routing invariably results in unpro- 
ductive cycles being introduced into the paths followed 
by some packets. One way to avoid such cycles employs 
a destination-dependent ordering v(r) over all routing 
agents. Given such an ordering, we can restrict router 
r\ to only send out packets according to its proportion 
vector along those links connecting t\ to routers r% such 
that v(ri) < v(ri); no traffic is sent along any other 
link. Assuming all routers have the same ordering v(r) 
for the same destination, having them all follow this 
scheme ensures that there will be no cycles. (In our 
work, for convenience, we choose v(r) for each destina- 
tion d to be the smallest cost estimates for going from 
r to d stored in the routing table on r.) 

We use the term "masking" to refer to any scheme of 
this nature in which the components of p are multiplied 
by constants set by the condition of the network. In 
particular, to define hard masking, let our routing agent 
be ri, let the destination be d, let the router neighbors 
of 7'i be the {ri}, let ri's proportion vector be p, and 
let the ordering over routers for destination d be v(r). 
Then in the technique of hard masking we calculate an 
applied proportion vector p' from p according to the 
following formula: 

, _ pi ejvjn) -v{n)) 

EjPo e(v( ri )-v(n)) ' 1 } 

where <d(x) is the Heaviside function that equals I for 
positive argument, and equals otherwise. We then use 
p' rather than p to govern the routing from router r\. 
(From now on, when we need to distinguish it from p\ 
we refer to p as a base proportion vector.) 

2.4 Soft Masking 

Although hard masking does avoid cycles while still 
having the generic behavior of not sending all traffic 
bound for a particular destination down a single link, 
it does so in a potentially brittle manner. This is be- 
cause a link will either be used fully (according to the 
proportion vector), or, for what may only be an in- 
finitesimal change in network conditions, not used at 
all. A more reasonable strategy would be for the routing 
agent to only gradually reduce its traffic along any link 
i as that link approaches the condition v(r\) = v(ri), 
in such a way that p\ = when v(r\) — v(ri). If it 
does this, a routing agent r\ will have essentially re- 
placed hard masking's discontinuous masking func- 
tion M ri (v(ri), v(ri)) = Q{v{r{) — v(ri))(i.e., the func- 
tion that gets multiplied by pi in the determination of 
the applied vector p£) with a continuous one. 



What is the best way to implement such a "gradual 
reduction of traffic" ? One obvious requirement is that 
the new masking function be both translation and scal- 
ing invariant with respect to changes to the functions 
v(.), since those functions only provide an ordering. In 
particular, in our implementation where the v(r) are 
costs given by times, we don't want either the zero- 
points or the units with which we measure time to mat- 
ter — changes to either should not affect the behavior 
of the router. 

To ensure translation invariance, it suffices to require 
that M ri (v(ri),v(rij) be of the form M ri (v(ri) — v(ri)). 
For scaling invariance, we need to have the function M ri 
obey the following condition: 

M ri (x) _ M ri (ax) 
M ri (y) ~ M ri (ay) 

for any a =/= 0. In other words, to preserve the ratios 
of traffic sent along all links under rescaling, for any 
values x and y the ratio t! r ' t—\ needs to be a constant, 

» M ri (ay) ' 

independent of a. 

Now to make sure that no traffic is sent down a link 
once v(ri) > v(ri), write M ri (x) = N ri (x)Q(x). (Note 
that Q(v(ri) — v(ri)) is both translation and scaling 
invariant.) Restricting ourselves to the regime where 
x > so that N ri (x) — M ri (x) and differentiating both 
sides of the scaling invariance condition with respect to 
a yields 

Nl(ax) = N' t {ay) 
X N t {ax) V N t {ayY 

which must hold for any a, x and y. In particular, take 
a = 1, and fix x, to get the following: 

y — = A where A is a constant. (2) 

Nt{y) 

Now define T ri (y) = ln[N ri (y)]. Having done this, 
equation || becomes T^{y) = A/y. Integrating both 
sides, we get 

T ri (y) = Dln(y) + E (3) 

where D and E are constants. Exponentiating both 
sides, and recalling that T ri (y) — ln[N ri (y)], we get the 
solution 

N ri (x)= X P. (4) 

(The overall multiplicative constant has been set to 1; it 
is irrelevant in that it gets divided out when one divides 
by the normalizing factor to calculate p.) 

Combining the two invariance properties gives us the 
final soft masking function: 

N ri (x,y) = (x-y)P. (5) 



So for routing agent r%, "soft masking" means that the 
applied proportion vector is set by the following: 

, Pi e_Kri) -v(n)) e ^ ri )-^» 

Pi ~EjPj © («(n) - v(rj)) e /3Mn)-^ 3 )) • t > 

2.5 Implementation of Proportional Routing 

Perhaps the most straight-forward implementation of 
proportional routing is for each routing agent to use a 
random number generator with probabilities set to the 
proportion vectors to decide where to route each succes- 
sive packet. This simple scheme has a major drawback 
however. For large numbers of packets the realized pro- 
portions of the packets actually sent will approximate 
the actual proportions arbitrarily well. However this 
is not the case when the number of packets is small. 
In particular, when masking is used, both the actual 
proportion vectors (as formally defined above) and the 
actually realized routing proportions will tend to change 
fairly frequently. A probabilistic approach may not re- 
sult in such changing proportions tracking each other 
accurately. 

To alleviate these concerns we use deterministic pro- 
portional routing. In this scheme, for each destination, 
each routing agent keeps track of the number of pack- 
ets pkti sent though each outgoing link along with 
the total number of packets sent pkt to tai- Determin- 
istic proportional routing consists of sending packets 
down the link which has the largest discrepancy between 
the desired proportion of packets sent though that link 
(pi * pkttotai) and the actual number of packets sent 
through that link (pki). 

Let's consider the following example to illustrate this 
method: A routing agent has three outgoing links, l\, 
I2, and Z3, and its current proportion vector is (0.59, 
0.31, 0.1). If this agent needs to send 10 packets before 
changing its proportion vector, it should send 6, 3 and 
1 packets respectively along each of the outbound links. 

Table |l| shows how each successive routing decision 
is made in this situation. The first column has the to- 
tal number of packets that have been routed, while the 
second column details the cumulative number of pack- 
ets that have been sent down each outgoing link. The 
third column shows the "desired" packet split at this in- 
stance, which is formed by multiplying the total number 
of packets by the proportion vector. (Note that since 
this will in general provide fractional packets, it cannot 
be the actual split.) The fourth column shows the dif- 
ference between the actual split and the desired split. 
Finally, the last column gives gives the largest entry of 
the fourth column, which is the link to which the next 
packet should be sent. 

As the splits indicate, this online method not only 
routes packets in a way that results in the optimal split 
over all 10 packets, but also selects the best split at 



Table 1: Deterministic Proportional Routing (all entries given for i G 1,2, 3) 
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each intermediate step. (Formally, one can prove this 
optimality holds for a large suites of metrics measuring 
how bad a particular discrepancy between desired and 
actual vectors is, including in particular the L 2 and L 1 
metrics.) 

As mentioned in the introduction, it is possible to 
implement this scheme extremely quickly, using only 
additions and pairwise comparisons. In addition, the 
scheme can be modified to allow more recent routing 
decisions to matter more than older ones, to prevent 
"round-robining" in which packets arrive out of order 
at their ultimate destination, etc. See [p6| . 

3 Learning Base Proportions 

The focus of our study wasn't on finding optimal RL al- 
gorithms for routing, but rather on determining whether 
RL-based agents running in an adaptive agent interac- 
tion structure could outperform conventional routing al- 
gorithms. Accordingly, the RL algorithms we used were 
rather unsophisticated. All of them bin time into suc- 
cessive (learning) intervals. The actions (i.e., applied 
proportion vectors) of the individual routing agents are 
not allowed to change across a learning interval. These 
intervals serve as the smallest observable time unit for 
the generation of learning data. Accordingly, they need 
to be long enough to obtain an unambiguous estimate 
of what system- wide throughput would be if the actions 
currently being undertaken by the agents were contin- 
ued indefinitely, i.e., long enough to allow for the cur- 
rent proportion vectors to dominate any lingering effects 
from the previous set of proportion vectors. Conversely, 
we do not want the interval to be too long, lest it take 
too long to generate training data, and more generally 
to allow the agents to adapt to changes to network traf- 
fic. 



3.1 Learning Algorithms 

All of the RL algorithms we investigated involved the 
following three successive stages: 

1. Initialization: The agents ascertain the network 
topology. This is a conventional stage needed for 
any network to "boot up" . 

2. Training: The agents explore the action space 
to collect data that will be subsequently used by 
the learners. A fixed sequence of different pro- 
portion vectors are applied by the set of all rout- 
ing agents and the associated sequence of system- 
wide throughputs for all those learning intervals 
is recorded. Each element of this sequence will 
generate an RL input /output pair for each agent. 
For each agent, for each interval, the "input" is 
the action taken by the agent together with any 
features concerning the network (e.g., proportion 
vectors of other agents) it observes during the asso- 
ciated interval, and the output is the system-wide 
throughput for that interval. This stage can be 
viewed either as part of the boot process of the 
system, which generates the initial training sets 
for the agents, or as a way of mimicking behavior 
in the middle of an ongoing system by forming a 
rough guess for the "mid-stream" training sets in 
that system. 

3. Learning: Choose actions and thereby generate 
more training input/output pairs, trading off ex- 
ploration and exploitation as one does so: 

• Immediately after the training stage, for each 
learning agent and for each destination: 
— Sweep through the possible proportion val- 
ues (range of actions), ranging from to 
1.0 in increments of .05. (In our exper- 
iments, m was always 2, so proportion 



vectors reduced to single-dimensional real 
numbers.) 

— For each such point, find the k nearest 
neighbors in the training set (nearest in 
the input space), and use these neighbors 
to estimate the corresponding system- wide 
throughput with a memory based learn- 
ing algorithm. (Examples of such a learn- 
ing algorithm are taking a simple mean of 
those k throughputs as one's estimate, or 
forming a LMS linear fit through those k 
points). 

— Select the point with the best estimated 
system-wide throughput and set the pro- 
portion vector to this value for the dura- 
tion of the current learning interval. 

• For subsequent learning intervals: 

— Store the input / output example generated 
in the previous learning interval. 

— Sample n values near the previous propor- 
tion vector by sampling a Gaussian cen- 
tered there. 

— For each such sample point, find its k near- 
est neighbors in the training set and use 
these neighbors to estimate the correspond- 
ing system-wide throughput as above. 

— Use a Boltzmann distribution |^] over those 
n estimated throughputs to select a pro- 
portion vector to be applied for the dura- 
tion of the current learning interval. All 
three of k, n, and the Boltzmann temper- 
ature are held fixed throughout the run. 

3.2 Leader-Follower Learning 

In this adaptive agent interaction structure some agents 
have empty feature spaces. Other agents include the ac- 
tions of some of the agents of the first type in their input 
spaces. Agents of this second type arc called followers, 
and the agents whose actions followers include in their 
input spaces are called leaders. Note that to apply this 
scheme, at the beginning of each learning interval the 
leaders must first decide on their actions, and then the 
followers use those choices to decide on their actions. 



3.3 Interleaved Learning 

One potentially unsatisfactory aspect of the leader-follower 
structure is the asymmetric way it breaks down the 

2 A Boltzmann distribution allows one to balance exploration 
against exploitation so that one doesn't get stuck in suboptimal parts 
of the space. It does this by selecting actions in a probabilistic man- 
ner, where the actions with the higher immediate payoffs have a larger 
probability of being selected. The temperature parameter of the dis- 
tribution determines the amount of exploration performed (a low tem- 
perature means that the best action has a high probability of being 
selected, whereas a high temperature moAfts that actions have a more 
even likelihood of being selected). Sec |29j. 



agents. A less asymmetric adaptive structure has the 
agents "interleave" their decisions in such a way that 
every agent is both a leader in some respects, and a 
follower in others. In this scheme, all agents have the 
actions of other agents in their feature spaces. However 
the agents are broken into two separate groups, where 
the learning intervals of the two groups are offset from 
each other. The offset is arranged so that any learning 
interval for the first group overlaps in equal halves with 
two successive learning intervals for the second group, 
and vice-versa. (In other words, if the intervals of the 
first group extend from times 1 to 3, 3 to 5, etc., those 
of the second group extend from 2 to 4, 4 to 6, etc.) 

4 Experimental Results 

A series of experiments was conducted using the OpNet 
discrete event network simulator (version 4.04). Each 
router in OpNet has an inbound queue and an outbound 
queue. Links between routers have infinite speed but 
limited bandwidth (1000 bits / simulated second). The 
Bellman- Ford algorithm utilized in the experiments was 
the implementation included with OpNet. 

For the experiments in this article, this time delay 
was experimentally determined to be 500 time units on 

For the explo- 



thc network discussed in Section 4.1 



ratio n s tep in the learning interval discussed in Sec- 
tion |3.l| , a Gaussian centered on the current propor- 
tion vector with a standard deviation of .0025 was used 
to generate 5 new proportion vectors. The Boltzmann 
temperature used to select over those new proportion 
vectors was 3000. 

4.1 Network Description 

The "Gemini" network shown in Figure ^ was used for 
testing the various routing approaches. In our experi- 
ments, routers Si and S2 are the sources where all pack- 
ets are generated. Nodes Di and Di are the possible 
(ultimate) destination nodes. Packets generated at St 
are sent to Di (for i = 1,2). The traffic stream was 
simulated by generating packets (consisting of 1000 bits 
each) at both source nodes with the time between suc- 
cessive packets determined by randomly sampling the 
intervals [.24s, .26s] and [.28s, .30s], respectively. The 
intermediate (non-source) routers have their proportion 
vectors set to direct 90% of their packets forward toward 
the appropriate destination nodes. 

4.2 Experimental Setup 

The performance of each routing method was evalu- 
ated over 40 trials. The initialization stage was 20s 
long, the training stage then ran from time t = 20s to 
t = 15, 000s, and the learning stage from t = 15, 000s to 
t = 40, 000s. The total delay was measured as the sum 



of the delays of all packets generated during the learn- 
ing period. The simulation continued for 5000s beyond 
the learning stage to allow the packets generated in the 
learning stage to reach their destinations. During that 
time the source nodes continued to generate new pack- 
ets in order to maintain stationary conditions for the 
packets in transit. Those packets generated during this 
time were not included in the calculation of total delay 
however. 




Figure 1: The GEMINI network 



4.3 Results and Analysis 

The results of the experiments are summarized in Ta- 
ble | The first row contains the performance of BF. 
The row labeled "BF source only" reports performance 
when the source nodes used BF and the intermediate 
nodes operated with soft masked proportional routing. 
This allows a direct comparison of the effects of replac- 
ing BF with memory-based learning methods at the 
source nodes. The third row of the table summarizes 
performance when the ideal proportion vector is used. 
(Those vectors were ascertained by exhaustively run- 
ning through a suite of simulations in each of which 
the proportion vector never changed in the "learning 
period".) Under the rough assumption that these num- 
bers constitute an upper bound on performance with 
the learners, we can use these numbers to provide us 
with the "headroom" of each algorithm, that is with 
the amount by which each algorithm's performance falls 
short of the best possible performance. 

The algorithms used by the RL-based methods are 
presented next. The type of fit was linear based on the 
12 nearest neighbors. The learners reduced the per- 
formance headroom between Bellman-Ford and using 
the ideal proportions by 93-95%. Clearly, the agent- 
based approaches benefit from using the more sophis- 
ticated throughput estimates. Comparing the agent- 
based approaches to one another, leader-follower and 
interleaved learning reduce the performance headroom 
between the standard learner and the ideal proportions 
by 25%. Thus, the agent-based approaches where one 
or both of the learners have knowledge of the inten- 



tions/actions of the other agent have significantly better 
performance than the standard learner. 

5 Discussion 

Adaptivity is a feature of a MAS that becomes increas- 
ingly important the larger the MAS and the less reliable 
the environment in which it operates. Broadly speak- 
ing, adaptivity takes two forms: adaptivity of the indi- 
vidual agents, and adaptivity of the interaction struc- 
ture among the agents. We have investigated both 
forms of adaptivity in the important context of rout- 
ing over networks. In a set of experiments we found 
that simply modifying the action spaces of the agents 
to make them better suited to adaptive algorithms po- 
tentially improved throughput by up to a factor of 3.5 
over the traditional Bellman-Ford algorithm. We then 
investigated two schemes for how to have the agent in- 
teraction structure itself be adaptive. We found that 
these schemes both realized a significant fraction of this 
potential improvement, with an improvement factor of 
3 over Bellman-Ford. Furthermore, a 25% improve- 
ment was observed over an agent-based approach with 
no adaptive interaction structure. 
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